Reinforcement Learning: An Introduction: Foundations of Dynamic Programming in MDPs

Imagine you are the architect of a grand labyrinth. To find the shortest path from every single room to the exit, you don't need to walk them all. If you have a perfect map (the MDP dynamics), you can simply ask every room's neighbors how far they are from the goal and add your own travel cost. This is the essence of Dynamic Programming (DP).

The Theoretical Pillars

Full Backups: Unlike sampling methods that roll the dice on a single trajectory, DP performs a full backup. It looks at every possible successor state $s'$ and every possible reward $r$, weighting them by the transition probability $p(s', r | s, a)$.
Bootstrapping: DP is built on a leap of faith. We update our estimate for a state's value $V(s)$ using the current estimates of its neighbors' values $V(s')$. We don't wait for final results; we refine our guesses based on other guesses.
Optimal Fixed Points: The Bellman optimality equation guarantees that there is a unique $v_*$ that satisfies the recursive relationship. Iterative DP is the mathematical engine that drives our current value function toward this fixed point.

Analytic Deep Dive: DP Theory

Iterative Logic & Action Values

You are applying Dynamic Programming to solve a complex MDP where you have access to the complete transition probability distribution $p(s', r | s, a)$.

Can you guess what the entire family of optimal policies looks like for the Gambler's Problem when $p_h = 0.4$?

Answer:
The optimal policy for $p_h = 0.4$ is not unique. It consists of a family of 'greedy' actions that often include betting amounts to reach stable capital points like 25, 50, or 75. In these regions, the policy is non-intuitive, alternating between aggressive and conservative bets to maximize the probability of hitting the goal capital of 100 before going broke.

What is the analog of the value iteration backup (4.10) for action values, $q_{k+1}(s, a)$?

Answer:
The analog for action values is: $q_{k+1}(s, a) = \sum_{s', r} p(s', r | s, a) [r + \gamma \max_{a'} q_k(s', a')]$. This update incorporates the expected reward and the discounted value of the best possible next action in the resulting state.